Elsőként az Iris adathalmazt választottam. Programozási nyelvnek pedig az R-t azon belül ay H2O csomagot. Egy picit ‘overkill’ a feladathoz de hasznosnak találtam kipróbálni mivel munkában osztott rendszeren dolgozunk(Hadoop) és H2O-val lehet HDFS-ben tárolt nagyobb adathalmazokat is feldolgozni(összekötöttem a hasznost a hasznossal). Hátránya is van: kevés modell van implementálva. Például az SVM hiánzyik mivel nehezebben párhuzamositható osztott rendszeren.
Az iris adathalmaz kicsi viszont szerintem egész jól lehet majd mérni az osztályozók pontosságát mivel az egyik osztály elválasztható a másik 2től lineárisan, de az utóbbi 2 nem szétválasztható.(lásd alábbi ábra 3 változóval)
library(plotly)
plot_ly(iris,x=~Petal.Length,y=~Sepal.Length, z=~Petal.Width, color = ~Species, type="scatter3d",marker = list(opacity=0.5))
Csupán 4 valós változó van a virágok szirmainak méréseivel valamint a virág tipusa. Előfeldolgozást sem igényel az adathalmaz. A tipus átalakitható “one hot encoding” modszerrel ha éppenséggel valamelyik mérést szeretnénk becsülni a többi paraméter alapján. Az adathalmaz az R környezet része, nem volt szükség letölteni.
library(h2o)
h2o.init() #kapcsolódni-elinditani a szervert
irisdf <-as.h2o(iris) #betolteni az adatokat a szerverbe
summary(irisdf)
isplit<-h2o.splitFrame(irisdf,ratios = 4/5, destination_frames = c("train","test"),seed = 1) #felosztas
itrain <- isplit[[1]]
itest <- isplit[[2]]
summary(itrain)
summary(itest)
A split nem tökéletesen arányos mivel nagy adathalmazokra van kitalálva és arra működik jól.
A 3 osztályozó amit választottam: Naive Bayes, Neuralis háló és Random forest .
bayes <- h2o.naiveBayes(x=1:4,y=5,itrain)
##
|
| | 0%
|
|====================================================== | 83%
|
|=================================================================| 100%
bayes@model$training_metrics
## H2OMultinomialMetrics: naivebayes
## ** Reported on training data. **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.03200071
## RMSE: (Extract with `h2o.rmse`) 0.1788874
## Logloss: (Extract with `h2o.logloss`) 0.1098575
## Mean Per-Class Error: 0.04126016
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 45 0 0 0.0000 = 0 / 45
## versicolor 0 37 3 0.0750 = 3 / 40
## virginica 0 2 39 0.0488 = 2 / 41
## Totals 45 39 42 0.0397 = 5 / 126
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios:
## k hit_ratio
## 1 1 0.960317
## 2 2 1.000000
## 3 3 1.000000
bayes@model$apriori
## A Priori Response Probabilities:
## setosa versicolor virginica
## 1 0.357143 0.317460 0.325397
bayes@model$pcond
## [[1]]
## Sepal.Length:
## y_by_sepallength mean std_dev
## 1 setosa 5.006667 0.363318
## 2 versicolor 5.902500 0.546545
## 3 virginica 6.595122 0.621672
##
## [[2]]
## Sepal.Width:
## y_by_sepalwidth mean std_dev
## 1 setosa 3.437778 0.394444
## 2 versicolor 2.760000 0.324867
## 3 virginica 2.975610 0.287037
##
## [[3]]
## Petal.Length:
## y_by_petallength mean std_dev
## 1 setosa 1.460000 0.177610
## 2 versicolor 4.220000 0.483682
## 3 virginica 5.558537 0.566558
##
## [[4]]
## Petal.Width:
## y_by_petalwidth mean std_dev
## 1 setosa 0.255556 0.105649
## 2 versicolor 1.322500 0.213022
## 3 virginica 2.036585 0.277269
perfbayes <- h2o.performance(bayes,itest)
h2o.confusionMatrix(perfbayes)
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 5 0 0 0.0000 = 0 / 5
## versicolor 0 10 0 0.0000 = 0 / 10
## virginica 0 1 8 0.1111 = 1 / 9
## Totals 5 11 8 0.0417 = 1 / 24
nn <- h2o.deeplearning(x=1:4,y=5,itrain,hidden = c(10),epochs = 1000,diagnostics=TRUE,variable_importances = TRUE,export_weights_and_biases=TRUE) #egy 10es rejtett reteg
##
|
| | 0%
|
|=================================================== | 78%
|
|=================================================================| 100%
nn@model$training_metrics
## H2OMultinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.01527013
## RMSE: (Extract with `h2o.rmse`) 0.1235724
## Logloss: (Extract with `h2o.logloss`) 0.04608933
## Mean Per-Class Error: 0.01646341
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 45 0 0 0.0000 = 0 / 45
## versicolor 0 39 1 0.0250 = 1 / 40
## virginica 0 1 40 0.0244 = 1 / 41
## Totals 45 40 41 0.0159 = 2 / 126
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios:
## k hit_ratio
## 1 1 0.984127
## 2 2 1.000000
## 3 3 1.000000
nn@model$model_summary
## Status of Neuron Layers: predicting Species, 3-class classification, multinomial distribution, CrossEntropy loss, 83 weights/biases, 4.0 KB, 126,000 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 4 Input 0.00 %
## 2 2 10 Rectifier 0.00 % 0.000000 0.000000 0.177420 0.367138
## 3 3 3 Softmax 0.000000 0.000000 0.466268 0.485801
## momentum mean_weight weight_rms mean_bias bias_rms
## 1
## 2 0.000000 -0.125403 0.744704 0.525238 0.530411
## 3 0.000000 -0.673042 2.073248 -0.318591 0.525782
h2o.varimp_plot(nn)
h2o.weights(nn,matrix_id = 1)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 0.1368192 0.2493647 -0.7523624 -0.3132385
## 2 0.5641928 0.6266022 0.3723520 0.1660877
## 3 -0.4600934 0.6264148 -1.6257826 -0.7909308
## 4 -0.4742610 -0.3957320 0.7085864 -0.3115121
## 5 0.4626494 -0.1129947 -0.5199866 -0.9834628
## 6 -0.2412732 0.3762394 -1.6812392 -2.4659739
##
## [10 rows x 4 columns]
h2o.biases(nn,vector_id = 1)
## C1
## 1 1.04264398
## 2 1.14371285
## 3 0.01169682
## 4 0.12603429
## 5 1.25553962
## 6 0.19720104
##
## [10 rows x 1 column]
perfnn <- h2o.performance(nn,itest)
h2o.confusionMatrix(perfnn)
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 5 0 0 0.0000 = 0 / 5
## versicolor 0 10 0 0.0000 = 0 / 10
## virginica 0 0 9 0.0000 = 0 / 9
## Totals 5 10 9 0.0000 = 0 / 24
rf <- h2o.randomForest(x=1:4,y=5,itrain, ntrees = 40) #20 fabol allo RF
##
|
| | 0%
|
|=================================================================| 100%
rf@model$training_metrics
## H2OMultinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.03224648
## RMSE: (Extract with `h2o.rmse`) 0.179573
## Logloss: (Extract with `h2o.logloss`) 0.1214671
## Mean Per-Class Error: 0.04939024
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 45 0 0 0.0000 = 0 / 45
## versicolor 0 37 3 0.0750 = 3 / 40
## virginica 0 3 38 0.0732 = 3 / 41
## Totals 45 40 41 0.0476 = 6 / 126
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios:
## k hit_ratio
## 1 1 0.952381
## 2 2 1.000000
## 3 3 1.000000
rf@model$model_summary
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 40 120 15697 1
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 8 3.36667 2 13 5.45000
h2o.varimp_plot(rf)
perfrf <- h2o.performance(rf,itest)
h2o.confusionMatrix(perfrf)
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 5 0 0 0.0000 = 0 / 5
## versicolor 0 10 0 0.0000 = 0 / 10
## virginica 0 1 8 0.1111 = 1 / 9
## Totals 5 11 8 0.0417 = 1 / 24
gendata <- function(n,params){
sl1 <- params[[1]][1,] #s-sepal,p-petal l-lenght w-width 1-setosa
sw1 <- params[[2]][1,]
pl1 <- params[[3]][1,]
pw1 <- params[[4]][1,]
dfsetosa<-cbind(rnorm(n=n,mean = sl1["mean"]$mean,sd = sl1["std_dev"]$std_dev),
rnorm(n=n,mean = sw1["mean"]$mean,sd = sw1["std_dev"]$std_dev),
rnorm(n=n,mean = pl1["mean"]$mean,sd = pl1["std_dev"]$std_dev),
rnorm(n=n,mean = pw1["mean"]$mean,sd = pw1["std_dev"]$std_dev),
rep("setosa",times=n))
dfsetosa <- as.data.frame(dfsetosa)
names(dfsetosa)<-c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")
sl1 <- params[[1]][2,] #2-versicolor
sw1 <- params[[2]][2,]
pl1 <- params[[3]][2,]
pw1 <- params[[4]][2,]
dfversicolor<-cbind(rnorm(n=n,mean = sl1["mean"]$mean,sd = sl1["std_dev"]$std_dev),
rnorm(n=n,mean = sw1["mean"]$mean,sd = sw1["std_dev"]$std_dev),
rnorm(n=n,mean = pl1["mean"]$mean,sd = pl1["std_dev"]$std_dev),
rnorm(n=n,mean = pw1["mean"]$mean,sd = pw1["std_dev"]$std_dev),
rep("versicolor",times=n))
dfversicolor<- as.data.frame(dfversicolor)
names(dfversicolor)<-c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")
sl1 <- params[[1]][3,] #3-virginica
sw1 <- params[[2]][3,]
pl1 <- params[[3]][3,]
pw1 <- params[[4]][3,]
dfvirginica<-cbind(rnorm(n=n,mean = sl1["mean"]$mean,sd = sl1["std_dev"]$std_dev),
rnorm(n=n,mean = sw1["mean"]$mean,sd = sw1["std_dev"]$std_dev),
rnorm(n=n,mean = pl1["mean"]$mean,sd = pl1["std_dev"]$std_dev),
rnorm(n=n,mean = pw1["mean"]$mean,sd = pw1["std_dev"]$std_dev),
rep("virginica",times=n))
dfvirginica<- as.data.frame(dfvirginica)
names(dfvirginica)<-c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")
rbind(dfsetosa,dfversicolor,dfvirginica,stringsAsFactors=FALSE)
}
testd <- gendata(n=200,bayes@model$pcond)
testd$Sepal.Length <- as.double(as.character(testd$Sepal.Length))
testd$Petal.Length <- as.double(as.character(testd$Petal.Length))
testd$Sepal.Width <- as.double(as.character(testd$Sepal.Width))
testd$Petal.Width <- as.double(as.character(testd$Petal.Width))
plot_ly(testd,x=~Petal.Length,y=~Sepal.Length, z=~Petal.Width, color = ~Species,marker = list(opacity=0.5))
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
td<-as.h2o(testd)
##
|
| | 0%
|
|=================================================================| 100%
h2o.performance(nn,td)
## H2OMultinomialMetrics: deeplearning
##
## Test Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.04139032
## RMSE: (Extract with `h2o.rmse`) 0.2034461
## Logloss: (Extract with `h2o.logloss`) 0.1566785
## Mean Per-Class Error: 0.055
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 200 0 0 0.0000 = 0 / 200
## versicolor 0 188 12 0.0600 = 12 / 200
## virginica 0 21 179 0.1050 = 21 / 200
## Totals 200 209 191 0.0550 = 33 / 600
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios:
## k hit_ratio
## 1 1 0.945000
## 2 2 1.000000
## 3 3 1.000000
h2o.performance(bayes,td)
## H2OMultinomialMetrics: naivebayes
##
## Test Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.01198691
## RMSE: (Extract with `h2o.rmse`) 0.1094847
## Logloss: (Extract with `h2o.logloss`) 0.04045009
## Mean Per-Class Error: 0.01666667
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 200 0 0 0.0000 = 0 / 200
## versicolor 0 193 7 0.0350 = 7 / 200
## virginica 0 3 197 0.0150 = 3 / 200
## Totals 200 196 204 0.0167 = 10 / 600
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios:
## k hit_ratio
## 1 1 0.983333
## 2 2 1.000000
## 3 3 1.000000
h2o.performance(rf,td)
## H2OMultinomialMetrics: drf
##
## Test Set Metrics:
## =====================
##
## MSE: (Extract with `h2o.mse`) 0.02725207
## RMSE: (Extract with `h2o.rmse`) 0.165082
## Logloss: (Extract with `h2o.logloss`) 0.1012265
## Mean Per-Class Error: 0.035
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 200 0 0 0.0000 = 0 / 200
## versicolor 0 189 11 0.0550 = 11 / 200
## virginica 0 10 190 0.0500 = 10 / 200
## Totals 200 199 201 0.0350 = 21 / 600
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios:
## k hit_ratio
## 1 1 0.965000
## 2 2 1.000000
## 3 3 1.000000
testd<-cbind(testd,as.data.frame(h2o.predict(nn,td)))
##
|
| | 0%
|
|=================================================================| 100%
testd$val <- ifelse(testd$Species == testd$predict,"+","-")
plot_ly(testd,x=~Petal.Length,y=~Sepal.Length, z=~Petal.Width, color = ~val,colors=c("red","green"),marker = list(opacity=0.2),text=~paste(Species," ",predict))
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
nn <- h2o.deeplearning(x=2:4,y=5,irisdf,hidden = c(10),epochs = 1000,diagnostics=TRUE,variable_importances = TRUE,export_weights_and_biases=TRUE) #egy 10es rejtett reteg
##
|
| | 0%
|
|=================================================================| 100%
nn@model$training_metrics
## H2OMultinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("iris")`
## MSE: (Extract with `h2o.mse`) 0.01423088
## RMSE: (Extract with `h2o.rmse`) 0.1192933
## Logloss: (Extract with `h2o.logloss`) 0.04304078
## Mean Per-Class Error: 0.02
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 50 0 0 0.0000 = 0 / 50
## versicolor 0 48 2 0.0400 = 2 / 50
## virginica 0 1 49 0.0200 = 1 / 50
## Totals 50 49 51 0.0200 = 3 / 150
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios:
## k hit_ratio
## 1 1 0.980000
## 2 2 1.000000
## 3 3 1.000000
#nn@model$model_summary
h2o.varimp_plot(nn)
rf <- h2o.randomForest(x=1:4,y=5,itrain, ntrees = 40) #20 fabol allo RF
##
|
| | 0%
|
|============================= | 45%
|
|=================================================================| 100%
rf@model$training_metrics
## H2OMultinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.04175959
## RMSE: (Extract with `h2o.rmse`) 0.2043516
## Logloss: (Extract with `h2o.logloss`) 0.1569769
## Mean Per-Class Error: 0.04939024
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
## setosa versicolor virginica Error Rate
## setosa 45 0 0 0.0000 = 0 / 45
## versicolor 0 37 3 0.0750 = 3 / 40
## virginica 0 3 38 0.0732 = 3 / 41
## Totals 45 40 41 0.0476 = 6 / 126
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios:
## k hit_ratio
## 1 1 0.952381
## 2 2 1.000000
## 3 3 1.000000
rf@model$model_summary
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 40 120 15503 1
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 7 3.24167 2 11 5.31667
h2o.varimp_plot(rf)
x <- seq(0.1,5,by=0.1)
y <- seq(0.1,7,by=0.1)
z <- seq(0.1,3,by=0.1)
gr<-expand.grid(Sepal.Width=x,Petal.Length=y,Petal.Width=z)
grid<-as.h2o(gr,destination_frame = "grid")
##
|
| | 0%
|
|=================================================================| 100%
prrf<-as.data.frame(h2o.predict(rf,grid))
##
|
| | 0%
|
|=================================================================| 100%
plot_ly(gr,x=~Petal.Length,y=~Sepal.Width, z=~Petal.Width, color = prrf$predict,marker = list(opacity=0.2))
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode